Skip to content

gh-148762: Speed up multiline regexes anchored by ^#152339

Open
haampie wants to merge 4 commits into
python:mainfrom
haampie:hs/fix/multiline-caret
Open

gh-148762: Speed up multiline regexes anchored by ^#152339
haampie wants to merge 4 commits into
python:mainfrom
haampie:hs/fix/multiline-caret

Conversation

@haampie

@haampie haampie commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Multiline regexes of the form re.compile("^foo", re.MULTILINE) currently
fall into the generic search loop, which calls SRE(match) at every
position in the subject string. Since a ^-anchored (SRE_AT_BEGINNING_LINE)
pattern can only match at the start of the string or right after a linebreak,
we can instead jump from one line start to the next, skipping all the
intermediate positions.

Benchmarks show good improvements in runtime across UCS-1/2/4; full
numbers are in the issue.

haampie added 3 commits June 26, 2026 19:53
Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>
Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>
Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>

@eendebakpt eendebakpt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude suggests adding some tests for coverage. Not sure we need all of them, but including in details here for reference.

Details ``` def test_search_anchor_at_beginning_line(self): # gh-148762: a multiline "^" search jumps between line starts. These # cases pin the behaviour the optimization must preserve. for pattern, cases in [ ('^', [ ('', [(0, 0)]), ('abc', [(0, 0)]), ('\n', [(0, 0), (1, 1)]), ('\n\n', [(0, 0), (1, 1), (2, 2)]), ('a\n', [(0, 0), (2, 2)]), # match at end after \n ('\na', [(0, 0), (1, 1)]), ('a\nb\nc', [(0, 0), (2, 2), (4, 4)]), ('a\n\nb', [(0, 0), (2, 2), (3, 3)]), # empty line ('\n\n\n', [(0, 0), (1, 1), (2, 2), (3, 3)]), ]), ('^a', [ ('a', [(0, 1)]), ('a\na', [(0, 1), (2, 3)]), ('a\nba\na', [(0, 1), (5, 6)]), ('ba\nab', [(3, 4)]), ('a\n', [(0, 1)]), # no match-at-end: needs 'a' ('\na', [(1, 2)]), ('aa\naa', [(0, 1), (3, 4)]), ('a\n\na', [(0, 1), (3, 4)]), ('a\nĀa\na', [(0, 1), (5, 6)]), # UCS2 string kind ('Ā\na\nĀ', [(2, 3)]), ('a\n\U0001F600a\na', [(0, 1), (5, 6)]), # UCS4 string kind ('\U0001F600\na', [(2, 3)]), ]), ]: p = re.compile(pattern, re.MULTILINE) for s, expected in cases: with self.subTest(pattern=pattern, string=s): self.assertEqual([m.span() for m in p.finditer(s)], expected)
    # bytes (8-bit) path
    pb = re.compile(b'^a', re.MULTILINE)
    for s, expected in [(b'a\nba\na', [(0, 1), (5, 6)]), (b'a\n', [(0, 1)]),
                        (b'\na', [(1, 2)]), (b'abc', [(0, 1)])]:
        with self.subTest(string=s):
            self.assertEqual([m.span() for m in pb.finditer(s)], expected)

    # pos / endpos: the search may begin mid-line or on a line start
    pa = re.compile('^a', re.MULTILINE)
    self.assertEqual([m.span() for m in pa.finditer('xa\na', 1)], [(3, 4)])
    self.assertEqual([m.span() for m in pa.finditer('a\na', 2)], [(2, 3)])
    self.assertEqual([m.span() for m in pa.finditer('a\na\na', 1, 3)], [(2, 3)])
    self.assertEqual([m.span() for m in pa.finditer('a\na', 0, 1)], [(0, 1)])

    # sub / subn / split also drive search()
    pc = re.compile('^', re.MULTILINE)
    self.assertEqual(pc.sub('#', 'a\nb\nc'), '#a\n#b\n#c')
    self.assertEqual(pc.sub('#', 'a\nb\n'), '#a\n#b\n#')
    self.assertEqual(pc.subn('#', 'a\nb\n'), ('#a\n#b\n#', 3))
    self.assertEqual(pc.split('a\nb'), ['', 'a\n', 'b'])
    self.assertEqual(pc.split('a\nb\n'), ['', 'a\n', 'b\n', ''])
</details>

Comment thread Modules/_sre/sre_lib.h
Comment on lines +1864 to +1867
while (ptr < end && !SRE_IS_LINEBREAK(*ptr))
ptr++;
if (ptr >= end)
return 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be

Suggested change
while (ptr < end && !SRE_IS_LINEBREAK(*ptr))
ptr++;
if (ptr >= end)
return 0;
+#if SIZEOF_SRE_CHAR == 1
ptr = memchr(ptr, '\n', end - ptr);
if (ptr == NULL)
return 0;
#else
while (ptr < end && !SRE_IS_LINEBREAK(*ptr))
ptr++;
if (ptr >= end)
return 0;
#endif

(I did not benchmark, not sure it is worth the change)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another issue/PR where I tried something like this, but it's hard to optimize when you don't know the distribution of \n character in the "haystack". See #148729 (comment); on the macbook it was hard to beat a hand-written loop:

The regression on darwin is because the letter i has a density of 2.88% in the corpus; the cross-over density is apparently about 2%, below which memchr is faster.

Based on wc -cl $(find -type f -name '*.py') from cpython's own sources, there are 988799 lines and 36153429 bytes, or a newline character density of 2.7%, meaning on my macbook the memchr would likely be slightly worse. But it depends on the use case.

So, I would hold off on combining different types of optimizations. This PR is about reducing the number of expensive match function calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants